Skip to content

feat(openchat): Mistral-7B inference engine (GQA + RoPE + RMSNorm + SiLU) OpenChat 3.5 / Mistral-7B architecture, fully distinct from GPT-2: - GQA: 32 query heads share 8 KV heads (4:1 ratio, 75% KV cache savings) - RoPE: rotary positional embedding (no learned positions) - RMSNorm: simpler norm without mean subtraction (both in models::layers) - SiLU: gated MLP (gate * up → down) with F32x16 element-wise SIMD - GGUF weight loading via hpc::gguf (Q4_K_M + Q4_0 dequantization added) - CausalEdge64 emission from attention patterns - OpenChat chat template (GPT4 Correct User/Assistant markers) - /v1/chat/completions API types All ops through crate::simd::F32x16 via models::layers. No weights stored — loaded at runtime from user-provided GGUF. 15 tests passing. 77 total across new modules. https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7#45

Merged
AdaWorldAPI merged 8 commits into
masterfrom
claude/transcode-deepnsm-rust-oNa1Z
Mar 29, 2026

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

No description provided.

claude added 8 commits March 29, 2026 20:02
…lette

Extracted from openai-community/gpt2 model.safetensors (522.7 MB):
  wte.weight: [50257, 768] f32 → Base17 golden-step projection

  gpt2_base17_50k.bin:   1,669 KB (50,257 tokens × 34B = 320× compression)
  gpt2_palette_50k.bin:     58 KB (256 centroids + 50K assignments = 9,294×)

GPT-2 uses SAME BPE tokenizer as Jina v4 → palette indices are
DIRECTLY COMPATIBLE. CausalEdge64 edges from GPT-2 and Jina use
the SAME S/P/O palette space. Zero mapping overhead.

Total weights in ndarray/src/hpc/jina/weights/: 3.4 MB
  Jina v4 (20K tokens):  694 KB
  GPT-2 (50K tokens):    1,727 KB
  COCA vocabulary:        997 KB

All three models share the same Base17 codec.
Load via LazyLock at startup. No external deps. No GPU.

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
…palette

Extracted from google-bert/bert-base-uncased (420 MB safetensors):
  bert.embeddings.word_embeddings.weight: [30522, 768] f32

  bert_base17_30k.bin:   1,013 KB (30,522 tokens × 34B = 424× compression)
  bert_palette_30k.bin:     38 KB (256 centroids + 30K assignments = 11,225×)

BERT (WordPiece) uses DIFFERENT tokenizer than GPT-2/Jina (BPE).
Needs mapping table for cross-model palette alignment.
BERT captures BIDIRECTIONAL context — complements GPT-2 autoregressive.

Total weights: 4.5 MB (3 models + COCA vocabulary)
  Jina v4:   694 KB (20K tokens, 2048D→17D)
  GPT-2:   1,727 KB (50K tokens, 768D→17D)
  BERT:    1,052 KB (30K tokens, 768D→17D)
  COCA:      997 KB (20K academic vocabulary)

All three models: same Base17 codec, same palette format, same CausalEdge64.

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
…odec

runtime.rs (300 lines, 11 tests):

  LazyLock<ModelRuntime> for Jina v4, GPT-2, BERT:
    Weights embedded via include_bytes! (zero file I/O)
    Loaded once, used forever. 4.5MB total in binary.

  Full codec chain per model:
    HHTL cascade:     heel_distance() → cascade_distance() (early exit)
    SimilarityTable:  heel_similarity() → calibrated f32 [0,1]
    CAM-PQ:           cam_fingerprint() → 6-byte [palette, dim0..4]
    CausalEdge64:     pack_spo_edge() → u64 with NARS + Pearl + temporal
    Base17 LEAF:      leaf_distance() → full resolution L1

  SimilarityTable built from EXACT 256×256 palette distance CDF.
  This IS the bgz17 pattern: empirical distribution → calibrated lookup.

  Usage:
    use ndarray::hpc::jina::runtime::{JINA, GPT2, BERT};
    let sim = GPT2.heel_similarity(token_a, token_b);  // O(1), calibrated
    let edge = GPT2.pack_spo_edge(s, p, o, 0.8, 0.6, 42);  // CausalEdge64
    let fp = BERT.cam_fingerprint(token);  // 6-byte CAM-PQ

  22 tests passing (12 codec + 11 runtime).

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
GPT-2 small (124M) forward pass with KV cache, all transcendentals
via crate::simd::F32x16 (LayerNorm, GELU, softmax, dot products).

- weights.rs: safetensors loader for 12 transformer layers
- inference.rs: autoregressive generation with temperature sampling
- api.rs: OpenAI-compatible request/response types (/v1/completions,
  /v1/embeddings, /v1/models) — transport-agnostic
- 9 tests passing (layer_norm, GELU, softmax, config, API types)

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
Weight matrices pre-transposed from [in_dim, out_dim] to [out_dim, in_dim]
during safetensors loading. matmul_vec_simd now reads contiguous rows via
F32x16::from_slice + mul_add — full SIMD utilization (768D = 48 × F32x16).

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
Full integration with the tensor codec pipeline:
- AttentionTable: palette-based O(1) approximate attention via
  jina::runtime::GPT2 (256×256 HEEL distance table)
- CausalEdge64 emission: attention patterns packed as SPO edges
  with NARS truth values (subject=query, predicate=head, object=key)
- HHTL cascade: token_similarity(), token_distance_leaf(),
  token_distance_cascade() methods on Gpt2Engine
- CAM-PQ: 6-byte token fingerprints via cam_fingerprint()

Both features are opt-in flags (use_attention_table, emit_causal_edges)
to avoid overhead when not needed. 14 tests passing.

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
Extract shared code into hpc/models/:
- safetensors.rs: generic file loader (used by GPT-2, SD, BERT)
- layers.rs: SIMD ops (layer_norm, gelu, silu, group_norm, softmax,
  matmul_vec, dot_product) — all via crate::simd::F32x16
- api_types.rs: OpenAI-compatible envelope (Usage, FinishReason, etc.)

Add hpc/stable_diffusion/ scaffold (code only, no weights):
- clip.rs: CLIP text encoder (same transformer as GPT-2, shared layers)
- unet.rs: UNet denoiser with Conv2D, GroupNorm, SiLU, timestep embedding
- vae.rs: VAE decoder (latent→RGB)
- scheduler.rs: DDIM noise scheduler with precomputed alpha schedule
- weights.rs: safetensors loader for SD CLIP weights
- api.rs: /v1/images/generations with full pipeline

52 tests passing. Zero weight files — disk space conscious.

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
…iLU)

OpenChat 3.5 / Mistral-7B architecture, fully distinct from GPT-2:
- GQA: 32 query heads share 8 KV heads (4:1 ratio, 75% KV cache savings)
- RoPE: rotary positional embedding (no learned positions)
- RMSNorm: simpler norm without mean subtraction (both in models::layers)
- SiLU: gated MLP (gate * up → down) with F32x16 element-wise SIMD
- GGUF weight loading via hpc::gguf (Q4_K_M + Q4_0 dequantization added)
- CausalEdge64 emission from attention patterns
- OpenChat chat template (GPT4 Correct User/Assistant markers)
- /v1/chat/completions API types

All ops through crate::simd::F32x16 via models::layers.
No weights stored — loaded at runtime from user-provided GGUF.
15 tests passing. 77 total across new modules.

https://claude.ai/code/session_01Y69Vnw751w75iVSBRws7o7
@AdaWorldAPI AdaWorldAPI merged commit ffe89e3 into master Mar 29, 2026
4 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants